\documentclass{article}
\usepackage{algorithmicx,algorithm,algpseudocode,booktabs,url,amstext}

\renewcommand{\cite}[1]{}

\begin{document}

\begin{abstract}
  In this paper, we investigate the potential of GPUs for performing link
  structure analysis of social graphs. Specifically, we implement Twitter's
  WTF (``Who to Follow'') recommendation system on a single GPU\@. Our
  implementation shows promising results on moderate-sized social graphs. It
  can return the top-K relevant users for a single user in 172~ms when
  running on a subset of the 2009 Twitter follow graph with 16 million users
  and 85 million social relations. For our largest dataset, which contains
  75\% of the users (30 million) and 50\% of the social relations (680
  million) of the complete follow graph, this calculation takes 1.0~s. We
  also propose possible solutions to apply our system to follow graphs of
  larger sizes that do not fit into the on-board memory of a single GPU\@.
\end{abstract}

\section{Introduction}

Many web service providers use recommendation systems to sell products or
to increase the value of their service. Shopping services like Amazon
suggest products users may be interested in buying, news sites recommend
articles based on a user's reading history, and streaming services like
Netflix recommend movies and television shows to watch. Social networking
services recommend people that the current user may want to connect with.
In a social network, where the content is entirely user-generated, a good
recommendation system is key in retaining and engaging users.

Twitter is a social media service that allows users to broadcast short,
140-character \emph{tweets} to all users who choose to \emph{follow} them.
Twitter's success depends on acquiring and maintaining an active user base.
A user's satisfaction with the service depends almost entirely on the
content generated by the other users in their social network. They need to
subscribe to the tweets of users they are interested in reading updates
from. Finding users to follow can be difficult, especially for new users,
so Twitter provides them with recommendations of accounts they may be
interested in subscribing to. This service is called ``Who to Follow'', or
WTF\@. Twitter calculates these recommendations by analyzing the link
structure of the follow graph~\cite{Gupta:2013:WTW}. Recommendations should
be provided and updated in real time to keep up with changes in the
follower graph. For a graph with hundreds of millions of nodes and billions
of edges, this is no small problem. Graphics processing units (GPUs) have a
highly parallel architecture that is potentially well-suited for this kind
of large-scale graph analysis.

\begin{figure*}
  \centering
  \caption{Overview of Twitter's WTF algorithm. \emph{Frame 1:} The initial graph (red
      [dark] node is the user for whom recommendations are being computed).
    \emph{Frame 2:} The Circle of Trust (nodes in pink [dark]) is found using
    Personalized PageRank. \emph{Frame 3:} The graph is pruned to include only
    the CoT and the users they follow. \emph{Frame 4:} The relevance scores of
    all users on the right side are computed with Twitter's Money algorithm.
    Node 4 will be suggested for Node 0 to follow because it has the highest
    value of all nodes the user is not already following.}
\end{figure*}

GPUs are throughput-oriented, highly parallel architectures that have
proven to be very efficient for solving many large-scale problems with
plenty of available parallelism. The modern GPU has become the most
cost-effective way to perform massive amounts of computation. A single GPU
can often outperform a cluster of CPUs; however, they have not yet been
able to gain a foothold in data centers. This is due to various issues,
such as programmability, limited memory size, and communication costs. GPUs
have a very different architecture from CPUs. This makes GPUs efficient at
solving large problems, but also creates unique programming challenges.

Solving graph problems on a GPU is particularly difficult because although
these problems usually have a large amount of parallelism, that parallelism
is irregular: in social graphs, nodes typically have widely varying
connectivity and thus wildly varying work per node. As well, graph
algorithms usually involve traversing the graph, including paths that split
and join along the way; this complicates the control flow, because paths
could intersect at any time. Nonetheless, some work has been done on
implementing PageRank, a popular graph analysis algorithm, on
GPUs~\cite{Rungsawang:2012:FPC,Wu:2010:EPS}; however, this work does not
confront the irregular challenges of graph traversal because they use the
linear algebra formulation of the PageRank algorithm. GPUs can work well on
native graph representations: Merrill et al.\ used GPUs to achieve 4x
speedups over multicore CPUs with breadth-first
search~\cite{Merrill:2012:SGG}. Our work leverages some of Merrill et al.'s
traversal strategies, but maps a more complex graph algorithm to GPUs. Very
little work has been done using GPUs for recommendation systems. Srinivasa
et al.\ implemented the friends-of-friends algorithm on a
GPU~\cite{Srinivasa:2013:GIO}. Friends-of-friends is a basic recommendation
algorithm that takes a user and nodes adjacent to the user and returns the
set of second level nodes that are adjacent to these nodes. Srinivasa et
al.\ only tested their algorithm on small graphs (fewer than 40,000 nodes),
but they did achieve a speedup of about 2x over a basic CPU implementation.

\section{Twitter's WTF Algorithm}

Two main stages comprise the WTF recommender. In the first stage, Twitter
calculates a Personalized PageRank (PPR) for the user. PPR assigns a
ranking to all nodes in the network based on how closely connected they are
to the main node (the user interested in recommendations). This ranking is
used to find the top 1000 ranking nodes. These nodes form the user's
``Circle of Trust'' (CoT), which consists of the 1000 nodes closest to the
user. Pruning the graph in this way increases the personalization of the
recommendation and reduces spam. Next, they create a bipartite graph with
the CoT on one side, and the users followed by the CoT on the other. All
other nodes are pruned from the graph. The final step is Twitter's
``Money'' algorithm, a graph analysis algorithm that determines which
accounts the user is most likely to be interested in following. Figure 1
shows a schematic of the entire WTF algorithm.

\subsection{Personalized PageRank}
PageRank is an algorithm for ranking nodes in a graph based on the
structure of the graph~\cite{Page:1999:TPC}. It was originally used on the
web graph to rank webpages for search engines. The PageRank of a node can
be thought of as the probability that a random walker traversing the graph
along the edges will land on that node. It is a measure of how
well-connected the node is, and therefore, how important it is to the
graph. A Personalized PageRank (PPR) calculation relative to node $A$ is
identical to the normal PageRank calculation, except all random walks begin
at node $A$, rather than a random node. Overall, a Personalized PageRank
calculation for $A$ shows which nodes are most closely related to $A$\@.

PageRank can be calculated using Monte Carlo methods or power iteration. A
Monte Carlo method for Personalized PageRank would be to actually perform
many random walks on the graph and maintain a count of the number of times
each node is visited, then use these counts to estimate the stationary
distribution. Power iteration methods formulate the problem as a system of
linear equations and use linear algebra techniques to solve for the ranking
values. Twitter chose a Monte Carlo method for their PPR, while we chose a
power iteration method for our implementation. We discuss the reasoning for
this decision in Section~\ref{sec:discussion}.

\subsection{Money}
Twitter's Money algorithm~\cite{Goel:2014:TWT} is similar to a combination
of Kleinberg's HITS algorithm~\cite{Kleinberg:1999:ASH} and Lempel and
Moran's SALSA~\cite{Lempel:2001:STS}. First, the graph is transformed into
a bipartite graph, with the CoT on the left side, and the users the CoT
follows on the right side, as shown in the second and third stages of
Figure 1. If a user is both in the CoT and followed by someone in the CoT,
this node will appear on both sides of the bipartite graph. The CoT nodes
are assigned a \emph{similarity} value, and the users they follow are
assigned a \emph{relevance} value. Initially, the user we are trying to get
recommendations for, $C$, has their similarity score set to 1, and all
others have their similarity or relevance scores set to 0. The Money
algorithm distributes each CoT member's similarity score to the relevance
scores of all the users they follow. The followers then distribute their
relevance scores back across the graph to all of their followers. As in
PPR, the Money algorithm can also be written as a system of linear
equations. From the solution of this system of equations, Twitter finds the
nodes with the highest relevance scores. These are the accounts they
recommend to the user in the Who to Follow feature. The similarity scores
are used for other features, such as the ``similar to you'' feature and
targeted advertising.

\section{Parallelizing Graph Algorithms on the GPU}

Both PPR and Money are link prediction algorithms that traverse the graph
and assign rank values for a subset of nodes. Like many other graph
algorithms, they can be viewed as iterative convergent processes. There is
a data dependency between iterative steps, but within each step, a large
number of independent edge-centric or vertex-centric operations can be
parallelized. However, to fully exploit the compute capabilities of GPUs,
we need special strategies to handle irregular memory access and work
distribution.

For compact and efficient memory access, we use the compressed sparse row
(CSR) format to represent the follow graph on the GPU\@. It uses a
column-indices array, $C$, to store a list of neighbor vertices and a
row-offsets array, $R$, to store the offset of the neighbor list for each
vertex. This representation enables us to use parallel primitives such as
prefix sum to reorganize sparse and uneven workloads into dense and uniform
ones in all phases of graph processing.

\subsection{Graph Primitives}
To solve the issue of irregular work distribution, we design two graph
primitives: \emph{graph traversal} and \emph{filtering}. Graph traversal
starts with a queue of vertices we call the \emph{frontier}. Traversal then
takes several iterations to advance toward other vertices by visiting the
neighbor list of all vertices in the current frontier in parallel and
adding the neighbor vertices to the new frontier. Filtering starts with a
frontier that contains either edges or vertices, does a user-defined
validation test for every item in the frontier in parallel, then generates
a new frontier containing only the items that have passed the validation
test.

In PPR, graph traversal and filtering alternate until all the rank values
converge and there are no vertices in the frontier for the graph traversal
primitive. In Money, we use the graph traversal primitive to bounce back
and forth between two disjoint sets in a bipartite graph: the CoT and the
union of all CoT vertices' neighbor lists. Since in graph traversal
primitives, the neighbor lists can vary greatly in size, traversing these
neighbor lists in parallel efficiently and in a load-balanced way is
critical for the performance of our system. Merrill et
al.~\cite{Merrill:2012:SGG} uses specialized strategies according to
neighbor list size in the context of a parallel breadth-first search (BFS)
algorithm. The algorithm efficiently reduces the amount of overhead within
each kernel and better utilizes the GPU\@. Our implementation extends this
strategy with two important improvements. First, we add inline device
functions to perform user-specific computations and to reuse graph
traversal primitives for both PPR and Money by just replacing the function
we load when visiting edges and vertices. Second, BFS's operations on
vertices are idempotent, but ours are not, so we guarantee the correctness
of both algorithms by adding atomics to resolve race conditions between
edges converging on a vertex.

\subsection{Application to Other Graph Algorithms}
The graph traversal primitive distributes the parallel workload per edge or
per node in a graph to tens of thousands of threads on the GPU to process
in parallel, and the filtering primitive reorganizes the elements we want
to process in parallel for the next iteration. By reusing these two
primitives, we have implemented several graph-traversal-based ranking
algorithms such as PageRank, PPR, Money, HITS, and SALSA with minimal
programming cost. Several of these algorithms run on bipartite graphs,
which are key abstractions in several ranking and link prediction
algorithms. As far as we know, we are the first to target bipartite graph
algorithms on the GPU\@.

We implement bipartite graphs in the following way. A directed bipartite
graph is similar to a normal undirected graph, except in a directed
bipartite graph, we need to consider the outgoing edges and the incoming
edges of a vertex separately. We achieve this by reversing the source and
destination vertex ID for each edge while constructing the CSR data
structure to record the incoming edges and using an additional prefix sum
pass to compute the incoming degree for each vertex. In our graph
primitives, we also added a feature to switch between visiting the outgoing
edges of a vertex and visiting the incoming edges of a vertex. We use this
method in our SALSA implementations. In our Money and HITS implementation,
we do not need to calculate the in-degree of every node in the graph, but
just the ones connected to the CoT\@. We take advantage of this by finding
the incoming degree values for neighbors of the CoT on the fly with an
additional pass of the graph traversal primitive. Because of the small size
of the CoT, this extra pass takes negligible time but saves us gigabytes of
memory for large datasets.

\section{WTF Implementation}
In our implementation, we start by putting all the graph topology
information on the GPU\@. First, we compute the PPR value for each vertex
in the follow graph with a user-defined seed vertex. Then we radix-sort the
PPR values and take the vertices with the top $k$ PPR values (in our
implementation, $k=1000$) as the CoT and put them in a frontier. We then
run the Money algorithm, sort the vertices that the CoT follows, and
finally extract the vertices with the top relevance scores to use for our
recommendation.

\subsection{Personalized PageRank} Algorithm~\ref{alg:ppr} shows our implementation of PPR using the graph
primitives and functors we design. We update PPR using the following
equation:
\begin{equation}
  \textit{PPR}(v_i) = \left\{
  \begin{array}{l l}
    (1-\delta) + \delta \cdot \!\!\!\!\!\sum\limits_{v_j \in \textit{pred}(v_i)} \!\!\frac{\textit{PPR}(v_j)}{d_\text{OUT}(v_j)} & \text{$v_i$ is seed} \\
    \delta \cdot \sum\limits_{v_j \in \textit{pred}(v_i)} \frac{\textit{PPR}(v_j)}{d_\text{OUT}(v_j)}                            & \text{otherwise}
  \end{array}
  \right.
\end{equation}
where $\delta$ is a constant damping factor (typically set to 0.85),
$d_\text{OUT}(v_j)$ is the number of outbound edges on vertex $v_j$, and
$N$ is the total number of vertices in the graph.

The PPR algorithm starts by initializing problem data
(line~\ref{alg:ppr:problemdata}). It assigns the initial rank value for
each vertex as $\frac{1}{N}$, and puts all vertices in the initial
frontier. The main loop of the algorithm contains two steps. For each
iteration, we first use GraphTraversal to visit all the edges for each
vertex in the frontier and distribute each vertex's PPR value to its
neighbors (line~\ref{alg:ppr:distribute}). Then we use Filtering to update
the PPR value for the vertices that still have unconverged PPR values. We
give the seed vertex an extra value (line~\ref{alg:ppr:seed}) as in the
equation. Then we remove the vertices whose PPR values have converged from
the frontier (line~\ref{alg:ppr:filter}). The algorithm ends when all the
PPR values converge and the frontier is empty.

\begin{algorithm}
  \caption{Personalized PageRank} \label{alg:ppr}
  \begin{algorithmic}[1]
    \Procedure{Set\_Problem\_Data}{$G, P, \textit{seed}, \textit{delta}$} \label{alg:ppr:problemdata}
      \State $P.\textit{ranks\_curr}[1..G.\textit{verts}] \gets \frac{1}{N}$
      \State $P.\textit{src} \gets \textit{seed}, P.\textit{delta} \gets \textit{delta}$
      \State $P.\textit{frontier.Insert}(G.\textit{nodes})$
    \EndProcedure

    \Procedure{DistributePPRValue}{$s\_\textit{id}, d\_\textit{id}, P$}
      \State $\textit{atomicAdd}(P.\textit{rank\_next}[d\_\textit{id}],\frac{P.\textit{rank\_curr}[s\_\textit{id}]}{P.\textit{out\_degree}[s\_\textit{id}]})$ \label{alg:ppr:distribute}
    \EndProcedure

    \Procedure{UpdatePPRValue}{$\textit{node}\_\textit{id}, P$}
      \State $P.\textit{rank}\_\textit{next}[\textit{node}\_\textit{id}] \gets (P.\textit{delta} \cdot P.\textit{rank}\_\textit{next}[\textit{node}\_\textit{id}]) + (P.\textit{src} == \textit{node}\_\textit{id}) ? (1.0 - P.\textit{delta}) : 0$ \label{alg:ppr:seed}
      \State $\textit{diff} \gets \textit{fabs}(P.\textit{rank\_next}[\textit{node\_id}]-P.\textit{rank\_curr}[\textit{node}\_\textit{id}])$
      \State \textbf{return} $\textit{diff} > P.\textit{threshold}$ \label{alg:ppr:filter}
    \EndProcedure

    \Procedure{Compute\_PPR}{$G,P,\textit{seed}, \textit{delta}, \textit{max\_iter}$}
      \State \Call {Set\_Problem\_Data}{$G,P,\textit{seed}, \textit{delta}$}
      \While {$P.\textit{frontier.Size}() > 0$}
        \State \Call {GraphTraversal}{$G,P,\textit{DistributePPRValue}$}
        \State \Call {Filtering}{$G,P,\textit{UpdatePPRValue}$}
        \State $swap(P.\textit{rank\_next}, P.\textit{rank\_curr})$
      \EndWhile
    \EndProcedure
  \end{algorithmic}
\end{algorithm}

\subsection{Twitter's Money Algorithm} Algorithm~\ref{alg:money} shows our implementation of Twitter's Money
algorithm. We treat people in the CoT and people they follow as two
disjoint sets $X$ and $Y$ in the bipartite graph $G = (X \cup Y, E)$ they
form. For each node in $X$ (the CoT), the Money algorithm computes its
similarity score; for each node in $Y$, the Money algorithm computes its
relevance score. We use the following equations in our implementation:
\begin{equation}
  sim(x) = \left\{
  \begin{array}{l l}
    \alpha + (1-\alpha) \cdot \sum\limits_{(x,y) \in E} \frac{\textit{relevance}(y)}{d_\text{IN}(y)} & \text{$x$ \textrm{is seed}} \\
    (1-\alpha) \cdot \sum\limits_{(x,y) \in E} \frac{\textit{relevance}(y)}{d_\text{IN}(y)}          & \text{otherwise}
  \end{array} \right.
\end{equation}
\begin{equation}
  \textit{relevance}(y) = \sum\limits_{(x,y)\in E} \frac{\textit{sim}(x)}{d_\text{OUT}(x)}
\end{equation}

In the Money algorithm, we also initialize problem data first
(line~\ref{alg:money:problemdata}). We set the similarity score for the
seed vertex to 1, and set the similarity and relevance scores for all other
vertices to 0. We then put all vertices in the CoT in the initial frontier.
The main loop of the algorithm contains two steps of GraphTraversal, for
each iteration we use the first GraphTraversal to visit all the edges for
each vertex in the CoT and distribute each vertex's similarity score to its
neighbors' relevance scores (line~\ref{alg:money:relevance}). After this
step, we update the relevance scores so that they may be used in the
computation of similarity scores (line~\ref{alg:money:swaprel}). The second
GraphTraversal will visit all the edges for each vertex in the CoT again,
but this time it will distribute the neighbors' relevance scores back to
each CoT vertex's similarity score (line~\ref{alg:money:similarity}).
Again, we treat the seed vertex differently so that it can get more
similarity score during each iteration, which in turn will affect its
neighbors' relevance scores and other CoT vertices' similarity scores.
After the second GraphTraversal step, we update the similarity scores
(line~\ref{alg:money:swapsim}). As Twitter does in their Money
algorithm~\cite{Goel:2014:TWT}, we end our main loop after $1/\alpha$
iterations.

\begin{algorithm}
  \caption{Twitter's Money Algorithm} \label{alg:money}
  \begin{algorithmic}[1]
    \Procedure{Set\_Problem\_Data}{$G, P, \textit{seed}, \textit{alpha}$}\label{alg:money:problemdata}
      \State $P.\textit{relevance\_curr}[1..G.\textit{verts}] \gets 0$
      \State $P.\textit{sim\_curr}[1..G.\textit{verts}] \gets 0$
      \State $P.\textit{src} \gets \textit{seed}, P.\textit{alpha} \gets \textit{alpha}$
      \State $P.\textit{sim\_curr}[\textit{seed}] \gets 1$
      \State $P.\textit{frontier.Insert}(G.\textit{cot\_queue})$
    \EndProcedure

    \Procedure{DistributeRelevance}{$s\_\textit{id}, d\_\textit{id}, P$}
      \State $atomicAdd(P.\textit{relevance\_next}[d\_\textit{id}],\frac{P.\textit{sim\_curr}[s\_\textit{id}]}{P.\textit{out\_degree}[s\_\textit{id}]})$ \label{alg:money:relevance}
    \EndProcedure

    \Procedure{DistributeSim}{$s\_id, d\_id, P$}
      \State $val \gets (1-P.\textit{alpha}) \cdot \frac{P.\textit{relevance\_curr}[d\_\textit{id}]}{P.\textit{in\_degree}[d\_\textit{id}]} + (P.\textit{src} == s\_\textit{id}) ? (\frac{P.\textit{alpha}}{P.\textit{out\_degree}[s\_\textit{id}]}) : 0$    \label{alg:money:similarity}
      \State $\textit{atomicAdd}(P.\textit{sim\_next}[d\_\textit{id}], \textit{val})$
    \EndProcedure

    \Procedure{Compute\_Money}{$G,P,\textit{seed}, \textit{alpha}$}
      \State \Call {Set\_Problem\_Data}{$G,P,\textit{seed}, \textit{alpha}$}
      \For {$iter\text{\raisebox{.4ex}{\tiny{++}}} < 1/\textit{alpha}$}
        \State \Call {GraphTraversal}{$G,P,\textit{DistributeRelevance}$}
        \State $swap(P.\textit{relevance\_next}, P.\textit{relevance\_curr})$ \label{alg:money:swaprel}
        \State \Call {GraphTraversal}{$G,P,\textit{DistributeSim}$}
        \State $swap(P.\textit{sim\_next}, P.\textit{sim\_curr})$ \label{alg:money:swapsim}
      \EndFor
    \EndProcedure
  \end{algorithmic}
\end{algorithm}

\section{Experiments}\label{sec:experiment}
We ran all experiments in this paper on a Linux workstation with 2 $\times$
2.53~GHz Intel 4-core E5630 Xeon CPUs, 12~GB of main memory, and an NVIDIA
K40c GPU with 12~GB on-board memory. The parallel programs were compiled
with NVIDIA's nvcc compiler (version~6.0.1) with the -O3 flag. The
sequential programs were compiled using gcc 4.6.3 with the -O3 flag. The
datasets used in our experiments are shown in Table~\ref{tab:dataset}, and
Table~\ref{tab:runtimes} shows the runtimes of our GPU recommendation
system on these datasets. Runtimes are for GPU computation only and do not
include CPU-GPU transfer time; we discuss why in
Section~\ref{sec:discussion}. The wiki-Vote dataset is a real social graph
dataset that contains voting data for Wikipedia administrators; all other
datasets are follow graphs from Twitter and Google
Plus~\cite{Leslovec:2014:SLN,Kwak:2010:WIT}. Twitter09 contains the
complete Twitter follow graph as of 2009; we extract 75\% of its user size
and 50\% of its social relation edge size to form a partial graph that can
fit into the GPU's memory.

\begin{table}
  \centering
  \caption{Experimental Datasets\label{tab:dataset}}
  \begin{tabular}{*{6}{l}}
    \toprule Dataset   & Vertices & Edges  \\
    \midrule wiki-Vote & 7.1k     & 103.7k \\
    twitter-SNAP       & 81.3k    & 2.4M   \\
    gplus-SNAP         & 107.6k   & 30.5M  \\
    twitter09          & 30.7M    & 680M   \\
    \bottomrule
  \end{tabular}
\end{table}

\subsection{Scalability}

\begin{table}
  \centering
  \caption{Runtimes for Different Graph Sizes\label{tab:runtimes}}
  \begin{tabular}{*{6}{r}}
    \toprule Time (ms) & wiki-Vote & twitter & gplus & twitter09 \\
    \midrule PPR       & 0.45      & 0.84    & 4.74  & 832.69    \\
    CoT                & 0.54      & 1.28    & 2.11  & 51.61     \\
    Money              & 2.70      & 5.16    & 18.56 & 158.37    \\
    Total              & 4.37      & 8.36    & 26.57 & 1044.99   \\
    \bottomrule
  \end{tabular}
\end{table}

In order to test the scalability of our WTF-on-GPU recommendation system,
we ran WTF on six differently-sized subsets of the twitter09 dataset. The
results are shown in Figure~\ref{fig:runtime-vs-edge-count}. We see that
the implementation scales sublinearly with increasing graph size. As we
double the graph size, the total runtime increases by an average of 1.684x,
and the runtime for Money increases by an average of 1.454x. The reason
lies in our work-efficient parallel implementation. By doing per-vertex
computation exactly once and visiting each edge exactly once, our parallel
algorithm performs linear $O(m+n)$ work. The reason that we have better
scalability for the Money algorithm is that although we are doubling the
graph size each time, the CoT size is fixed at 1000. We address scalability
beyond a single GPU in Section~\ref{sec:discussion}.

\begin{figure}
  \centering
  \caption{Scalability of runtime versus edge count for our GPU recommendation
    system.\label{fig:runtime-vs-edge-count}}
\end{figure}

\subsection{Comparison to Cassovary}
We chose to use the Cassovary graph library for our CPU performance
comparison. The results of this comparison are shown in
Table~\ref{tab:cassovary}. Cassovary is a graph library developed at
Twitter. It was the library Twitter used in their first WTF
implementation~\cite{Gupta:2013:WTW}. The parameters and Cassovary function
calls for this implementation can be found in the Appendix.
\begin{table*}
  \centering
  \caption{Runtime comparison to Cassovary.\label{tab:cassovary}}
  \begin{tabular}{*{10}{r}}
    \toprule                & \multicolumn{2}{c}{wiki-Vote} & \multicolumn{2}{c}{twitter} & \multicolumn{2}{c}{gplus}  & \multicolumn{2}{c}{twitter09}                                           \\
    \midrule Step (runtime) & Cassovary                     & GPU                         & Cassovary                  & GPU                           & Cassovary & GPU   & Cassovary & GPU     \\
    \midrule PPR (ms)       & 418                           & 0.45                        & 480                        & 0.84                          & 463       & 4.74  & 884       & 832.69  \\
    CoT (ms)                & 262                           & 0.54                        & 2173                       & 1.28                          & 25616     & 2.11  & 2192      & 51.61   \\
    Money/SALSA (ms)        & 357                           & 2.70                        & 543                        & 5.16                          & 2023      & 18.56 & 11216     & 158.37  \\
    Total (ms)              & 1037                          & 4.37                        & 3196                       & 8.36                          & 28102     & 26.57 & 14292     & 1044.99 \\
    Speedup                 & \multicolumn{2}{r}{235.7}     & \multicolumn{2}{r}{380.5}   & \multicolumn{2}{r}{1056.5} & \multicolumn{2}{r}{13.7}                                                \\
    \bottomrule
  \end{tabular}
\end{table*}

We achieve speedups of up to 1000x over Cassovary for the Google Plus
graph, and a speedup of 14x for the 2009 Twitter graph, which is the most
representative dataset for the WTF application. One difference between the
GPU algorithm and the Cassovary algorithm is that we used the SALSA
function that comes with the Cassovary library, instead of using Twitter's
Money algorithm for the final step of the algorithm. Both are ranking
algorithms based on link analysis of bipartite graphs, and in the original
Who-To-Follow paper~\cite{Gupta:2013:WTW}, Gupta et al.\ use a form of
SALSA for this step, so this is reasonable for a comparison.

\section{Discussion}
\label{sec:discussion} On a graph with more than 175 million nodes and 20 billion edges, the WTF
algorithm currently takes around 500~ms in Twitter's data
center~\cite{Myers:2014:INS}. In contrast, our implementation can process a
graph with 25 million nodes and 340 million edges in a similar amount of
time (524~ms), and it takes 1 second to process our largest dataset, which
is still significantly smaller than the complete Twitter graph.

The Personalized PageRank calculation takes up the vast majority of the
runtime for larger graphs (Table~\ref{tab:runtimes}). This is because PPR
runs on the entire graph, but Money only runs on the pruned CoT graph,
which does not grow as quickly. One possible way to reduce the runtime
would be to precompute the CoT, and only run Money to update WTF\@. In this
scheme, PPR would only be run periodically, and it could be an offline
process. Alternatively, an incremental PPR calculation, as in Bahmani et
al.~\cite{Bahmani:2010:FIP}, could provide an estimate of the new CoT
without needing to iterate through the entire graph. With the precomputed
CoT, we would only need to run Twitter's Money algorithm to get the result.
According to our sublinear scalability model of Twitter's Money algorithm
runtime, we could compute the result in 300~ms on Twitter's circa-2009
follow graph.

We also note that our PPR calculation uses a power iteration method, while
Twitter's is a Monte Carlo method. Both methods are quite accurate, but
there are a few trade-offs~\cite{Avrachenkov:2007:MCM}. The power iteration
method is deterministic. Every time the algorithm runs on the same set of
data, we will get the same results. Monte Carlo methods involve actually
performing random walks to compute the probability distribution, so the
outcome will not be exactly the same every time. In terms of performance,
Monte Carlo methods give a reasonable approximation after only the first
iteration, but the error decreases very slowly with more iterations. Power
iteration, on the other hand, starts with a much more inaccurate
estimation, but the error decreases and converges quicker than for Monte
Carlo. We chose power iteration because it is both efficient and easy to
implement on a GPU; however, it is possible that a Monte Carlo method would
perform better on a GPU\@. Because Monte Carlo methods involve many
independent walks, they are potentially well-suited to the GPU's massively
parallel architecture.

The difference between Monte Carlo and iterative methods can be seen in the
camparison with the Cassovary runtimes in Table~\ref{tab:cassovary}. While
our GPU version is PPR-limited, the Cassovary implementation tends to be
SALSA-limited. This is because of the difference in the way we compute
Personalized PageRank. The Cassovary library PPR function uses the Monte
Carlo random walks method, walking the graph and maintaining a visit count
at each node, then ranking nodes by the number of visits. The number of
steps is fixed, so the Cassovary PPR scales well. Our method iterates until
convergence and computes the PPR for every node, not just ones that are in
the vicinity of the start node.

One major limitation of our current implementation is the size of GPU
memory. Today's GPUs have at most 12~GB of memory, whereas a CPU system can
easily have ten times as much memory. This means that the entire current
Twitter follow graph cannot fit in GPU memory. The twitter09 dataset is
close to the maximum size that can fit on a single GPU\@. To compute WTF on
the full graph, we would need to partition the graph and/or distribute it
across multiple nodes. Scaling any large-scale parallel data analysis
system, especially online social network system, beyond a single GPU
remains a major challenge today. The unstructured and highly irregular
connectivity of power-law graphs like social graphs makes it difficult to
design partitioning and synchronization strategies for such graphs.

Another limitation is the bandwidth of the connection between main memory
and GPU memory. Results in Section~\ref{sec:experiment} do not include the
data transfer time from CPU memory to the GPU\@. For our largest graph, the
transfer time is 852.24~ms---about 85\% of the compute time. In our
experiments, we assume that graph data will be resident on the GPU, because
it will be used to run a variety of algorithms on the follow graph, so
transfer time will not be significant overall; however, for a different use
case or very frequent updates to the graph, data transfer time could
seriously limit performance.

Fortunately, future GPU systems have potential solutions for the modest
size of today's GPU memories. The next node on NVIDIA's GPU roadmap is
``Pascal'' (2016), which can be connected to the CPU via a high-speed
``NVLink'' connection that allows access to the CPU's main memory at
CPU-main-memory speeds (as well as supporting unified virtual memory across
CPU and GPU\@). While such systems will still require careful memory
management, they eliminate the current performance disadvantage in the case
data fits in CPU memory but not GPU memory.

\section{Conclusions and Future Work} In this work, we have shown that it is possible to use a GPU for
recommendation algorithms on social graphs, but there are still many ways
in which the performance could be improved. Software platforms for
large-scale online social network analysis on hybrid CPU-GPU architectures
could potentially offer better throughput and performance on systems that
are more cost-effective than today's CPU-based cluster architectures.
However, moving workloads to GPUs is challenging for the following reasons:

\begin{itemize}
  \item Designing parallel algorithms for such systems is both difficult and time
        consuming.
  \item Today, limited PCIe bandwidth is a constraint for using discrete GPUs for
        communication-bounded tasks.
  \item The limited memory size on current GPUs makes it difficult to run
        algorithms on datasets that cannot fit in the GPU memory.
\end{itemize}

With the appearance of more graph processing libraries on the GPU, the
design and implementation of online social network analysis software on the
GPU is becoming easier and more efficient. Today, most online social
network analysis algorithms are still compute-bound when running on large
datasets. This makes a multi-CPU/multi-GPU architecture running across
multiple nodes a promising solution.

We propose a two-layer framework running on a three-layer memory hierarchy
where GPU memory serves as the fast cache, and CPU main memory serves as
the second level cache, sitting atop data stored on hard disks/SSDs. In our
case of building a recommendation system, the entire follow graph will be
stored in CPU memory with disk as backing store. Multiple nodes, each
containing one or more GPUs, will store the CoT for each user in the graph.
Currently Twitter has 250 million users. If we keep the size of CoT at 1000
and use unsigned integers for vertex IDs, then we will need 2~TB to store
CoTs for all users. That can be easily partitioned by user ID to fit on 4
or more machines with 512~GB or less of main memory. In this case, we can
run PPR as an offline algorithm that runs only once per day or after a
certain number of graph updates. The recommendation system can then work in
real time with around 100~ms running time. Because the number of vertices
in CoT is a constant 1000, the pruned graph that contains only vertices in
CoT and vertices they connect to will be much smaller than the original
follow graph. This set-up would reduce both the computational workload and
GPU memory requirements.

\bibliographystyle{abbrv} %\bibliography{wtf}
\begin{thebibliography}{10}

  \bibitem{Avrachenkov:2007:MCM}
  K.~Avrachenkov, N.~Litvak, D.~Nemirovsky, and N.~Osipova.
  \newblock {M}onte {C}arlo methods in {PageRank} computation: When one iteration is sufficient.
  \newblock {\em SIAM Journal of Numerical Analysis}, 45(2):890--904, Feb. 2007.

  \bibitem{Bahmani:2010:FIP}
  B.~Bahmani, A.~Chowdhury, and A.~Goel.
  \newblock Fast incremental and personalized {PageRank}.
  \newblock {\em Proceedings of the VLDB Endowment}, 4(3):173--184, Dec. 2010.

  \bibitem{Goel:2014:TWT}
  A.~Goel.
  \newblock The ``who-to-follow'' system at {T}witter: Algorithms, impact, and further research.
  \newblock WWW 2014 industry track, 2014.

  \bibitem{Gupta:2013:WTW}
  P.~Gupta, A.~Goel, J.~Lin, A.~Sharma, D.~Wang, and R.~Zadeh.
  \newblock {WTF}: The who to follow service at {T}witter.
  \newblock In {\em Proceedings of the International Conference on the World Wide Web}, pages 505--514, May 2013.

  \bibitem{Kleinberg:1999:ASH}
  J.~M. Kleinberg.
  \newblock Authoritative sources in a hyperlinked environment.
  \newblock {\em Journal of the ACM}, 46(5):604--632, Sept. 1999.

  \bibitem{Kwak:2010:WIT}
  H.~Kwak, C.~Lee, H.~Park, and S.~Moon.
  \newblock {W}hat is {T}witter, a social network or a news media?
  \newblock In {\em Proceedings of the International Conference on the World Wide Web}, pages 591--600, Apr. 2010.

  \bibitem{Lempel:2001:STS}
  R.~Lempel and S.~Moran.
  \newblock {SALSA}: The stochastic approach for link-structure analysis.
  \newblock {\em ACM Transactions on Information Systems}, 19(2):131--160, Apr. 2001.

  \bibitem{Leslovec:2014:SLN}
  J.~Leskovec.
  \newblock {SNAP}: Stanford large network dataset collection.
  \newblock \url{http://snap.stanford.edu/data/}.
  \newblock Accessed: 2014-05-18.

  \bibitem{Merrill:2012:SGG}
  D.~Merrill, M.~Garland, and A.~Grimshaw.
  \newblock Scalable {GPU} graph traversal.
  \newblock In {\em Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming}, PPoPP '12, pages 117--128, Feb. 2012.

  \bibitem{Myers:2014:INS}
  S.~A. Myers, A.~Sharma, P.~Gupta, and J.~Lin.
  \newblock Information network or social network?: The structure of the {T}witter follow graph.
  \newblock In {\em Proceedings of the Companion Publication of the International Conference on the World Wide Web}, WWW Companion '14, pages 493--498, Apr. 2014.

  \b  ibitem{Page:1999:TPC}
  L.~Page, S.~Brin, R.~Motwani, and T.~Winograd.
  \newblock The {PageRank} citation ranking: Bringing order to the web.
  \newblock Technical Report 1999-66, Stanford InfoLab, Nov. 1999.

  \bibitem{Rungsawang:2012:FPC}
  A.~Rungsawang and B.~Manaskasemsak.
  \newblock Fast {PageRank} computation on a {GPU} cluster.
  \newblock In {\em Proceedings of the 2012 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)}, pages 450--456, Feb. 2012.

  \bibitem{Srinivasa:2013:GIO}
  K.~G. Srinivasa, K.~Mishra, C.~S. Prajeeth, and A.~M. Talha.
  \newblock {GPU} implementation of friend recommendation system using {CUDA} for social networking services.
  \newblock In {\em Proceedings of the International Conference on Emerging Research in Computing, Information, Communication, and Applications}, pages  890--895, Aug. 2013.

  \bibitem{Wu:2010:EPS}
  T.~Wu, B.~Wang, Y.~Shan, F.~Yan, Y.~Wang, and N.~Xu.
  \newblock Efficient {PageRank} and {SpMV} computation on {AMD} {GPU}s.
  \newblock In {\em Proceedings of the 39th International Conference on Parallel   Processing}, pages 81--89, Sept. 2010.

\end{thebibliography}

\appendix
Here are the parameters and function calls used in our comparison using the
Cassovary graph library. Our inputs for the personalized PageRank
computation are as follows:

\begin{small}
  \begin{verbatim}
val numSteps = 100L * 1000L
val resetProb = 0.15
val maxSteps = None
val pathsSaved = Some(2)
val walkParams = RandomWalkParams(numSteps, resetProb, maxSteps, pathsSaved)
val graphUtils = new GraphUtils(graph)
val (topNeighbors, paths) = graphUtils.calculatePersonalizedReputation(startNode, walkParams)
\end{verbatim}
\end{small}

\noindent
Parameters for the SALSA calculation:

\begin{small}
  \begin{verbatim}
val leftResetProb = 0.2
val rightResetProb = 0
val numTopContributors = 5
val SALSA = new IterativeLinkAnalyzer(bipartiteGraphUtils, leftResetProb, rightResetProb, numTopContributors)
val numIterations = 5
val (topSimilarities, topRelevances) =
  SALSA.analyze(leftNodeInfo, numIterations, {LHSNodes => LHSNodes.neighborIds(OutDir)})
\end{verbatim}
\end{small}

\end{document}
